Communication Characteristics and Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-core SMP Nodes
نویسندگان
چکیده
Hybrid MPI/OpenMP and pure MPI on clusters of multicore SMP nodes involve several mismatch problems between the parallel programming models and the hardware architectures. Measurements of communication characteristics between cores on the same socket, on the same SMP node, and between SMP nodes on several platforms (including Cray XT4 and XT5) show that machine topology has a significant impact on performance for all parallelization strategies and that topology awareness should be built into all applications in the future. We describe potentials and challenges of the dominant programming models on hierarchically structured hardware. Case studies with the multizone NAS parallel benchmarks on several platforms demonstrate the opportunities of hybrid programming. 1. Mainstream HPC architecture Today scientists who wish to write efficient parallel software for high performance systems have to face a highly hierarchical system design, even (or especially) on “commodity” clusters (Fig. 1 (a)). The price/performance sweet spot seems to have settled at a point where multi-socket multi-core shared-memory compute nodes are coupled via high-speed interconnects. Inside the node, details like UMA (Uniform Memory Access) vs. ccNUMA (cache coherent Non-Uniform Memory Access) characteristics, number of cores per socket and/or ccNUMA domain, shared and separate caches, or chipset and I/O bottlenecks complicate matters further. Communication between nodes usually shows a rich set of performance characteristics because global, nonblocking communication has grown out of the affordable range. This trend will continue into the foreseeable future, broadening the available range of hardware designs even when looking at high-end systems. Consequently, it seems natural to employ a hybrid programming model which uses OpenMP for parallelization inside the node and MPI for message passing between nodes. However, there is always the option to use pure MPI and treat every CPU core as a separate entity with its own address space. And finally, looking at the multitude of hierarchies mentioned above, the question arises whether it might be advantageous to employ a “mixed model” where more than one MPI process with multiple threads runs on a node so that there is at least some explicit intra-node communication (Fig. 1 (b)–(d)). It is not a trivial task to determine the optimal model to use for some specific application. There seems to be a general lore that pure MPI can often outperform hybrid, but counterexamples do exist and results tend to vary with input data, problem size etc. even for a given code [1]. This paper discusses potential reasons for this; in order to get optimal scalability one should in any case try to implement the following strategies: (a) Reduce synchronization overhead (see Sect. 3.5), (b) reduce load imbalance (Sect. 4.2), (c) reduce computational overhead and memory consumption (Sect. 4.3), and (d) Minimize MPI communication overhead (Sect. 4.4). There are some strong arguments in favor of a hybrid model which tend to underline the assumption that it should lead to improved parallel efficiency as compared to pure MPI. In the following sections we will shed some light on Cray User Group 2009 Proceedings SMP node SMP node (a) (c) (d) (b) Node interconnect Socket 1
منابع مشابه
Using Hybrid Parallel Programming Techniques for the Computation, Assembly and Solution Stages in Finite Element Codes
The so called “hybrid parallelism paradigm”, that combines programming techniques for architectures with distributed and shared memories using MPI (Message Passing Interface) and OpenMP (Open Multi-Processing) standards, is currently adopted to exploit the growing use of multi-core computers, thus improving the efficiency of codes in such architectures (several multi-core nodes or clustered sym...
متن کاملPerformance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-Scale Multicore Clusters
The NAS Parallel Benchmarks (NPB) are well-known applications with the fixed algorithms for evaluating parallel systems and tools. Multicore clusters provide a natural programming paradigm for hybrid programs, whereby OpenMP can be used with the data sharing with the multicores that comprise a node and MPI can be used with the communication between nodes. In this paper, we use SP and BT benchma...
متن کاملComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. Message passing can be employed within and across the nodes of a cluster. Multiple levels of parallelism can be achieved by combining message passing and OpenMP parallelization. Which programming ...
متن کاملComparing the OpenMP, MPI, and Hybrid Programming Paradigms on an SMP Cluster
Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. Message passing can be employed within and across the nodes of a cluster. Multiple levels of parallelism can be achieved by combining message passing and OpenMP parallelization. Which programming ...
متن کاملAdvanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs
The parallelization process of nested-loop algorithms onto popular multi-level parallel architectures, such as clusters of SMPs, is not a trivial issue, since the existence of data dependencies in the algorithm impose severe restrictions on the task decomposition to be applied. In this paper we propose three techniques for the parallelization of such algorithms, namely pure MPI parallelization,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009